Load everthing required in console

Load airlines_dataset.xlsx data file using pandas.read_excel

Same for csv also

Examine first 5 rows of data

Examine last 5 rows of data

Return a tuple representing the dimensionality of the DataFrame.

(Row, Column)

Basic information of the dataset using .info()

Calculate summary statistics of all column df.describe(include= 'all')

default one without parameters gives summary statistics of integer column only df.describe()

default one without parameters gives summary statistics of integer column only df.describe()

Short introduction about dataset:

The dataset contains the airline's data of Indian Airways from (2019) which is verified in the below cells.

The dataset has 10683 rows & 11 columns as of raw state.

Columns are destined to increase after encoding in pre-processing and rows are destined to decrease because of duplicate data.

Airline column consists of different names of airlines that flew in the given routes in the span of a year(2019).

Date_of_Journey indicates the date it flew from the Destination to the Source along the Route mentioned in the respective columns of the dataset.

The departure time Dep_Time is the time when the plane flew which contains the time object whereas the arrival time Arrival_Time contains time as well as the date if arrived the next day.

The Total_Stops column can be derived from the number of routes the plane flew.

There are only two NaN values and the Additional_Info column has no information in { > 50 % } of the data. The column usually contains the data about the food serving, baggage details, and layover details.

The Price is the final column that indicates the price of service of airlines which can also be used as a predictive variable.

Airline

Categorical data

Plot pie chart for Airline.value_counts()

figsize= (10, 10) => figure size to be plot

labeldistance= None => removes labelling of each pie

legend= True => gives that beautiful labelled legend box with color

Date_of_Journey

Source and Destination

Routes and Stops

Additional_Info

Since 78% data in this column is empty or consists of "No info" which is of no use, it is safe to drop the column as the the data which do exist are not as too crucial.

Departure Time 'Dep_Time', 'Arrival_Time' and 'Duration'

df.Arrival_Time

df.Duration

Dropping redundant columns

Price

Removing duplicate data

Handling categorical data

The data in the airline column is nominal since the order of the data does not matter. The data is handled with one hot encoding.

Finding Outliers

Feature selection

Rearranging the columns

Important feature selection using ExtraTreesRegressor

Here, Total_Stops is the most important feature.

Profiling report

Data Wrangling

Data wrangling is the process of transforming data from its original "raw" form into more readily used format. It prepares the data for analysis. It is also known as data cleaning, data remediation or data munging.

It can be both manual or an automated process. When the dataset is immense, the manual data wrangling is very tedious and needs automation.

It consists of six steps:

Step 1: Discovering In this step, data is understood more deeply. It is also known as the way of familiarizing with data so it can be further passed to different steps. During this phase, some patterns in data could be identified and the issues with the dataset is also known. The values which are unnecessary, missing or incomplete are identified for addressing.

Step 2: Structuring Raw data has a strong probability to be in a haphazard manner and unstructured and it needs to be restructured in a proper manner. Movement of data is made for easier computation as well as analysis.

Step 3: Cleaning In this step, data is cleaned for high quality analysis. Here, null values will have to be changed and formatting will also be standardized. It also includes the processes of deleting empty rows and removing outliers ensuring there are as less errors as possible.

Step 4: Enriching In this step, it is determined whether all the data necessary for a project is fulfilled. If not, the dataset is extended by merging with another dataset or simply incorporating values from other datasets. If the data is complete, the enriching part is optional. Once new datas are added from another dataset, steps of discovering, structuring, cleaning and enriching dataset needs to be repeated.

Step 5: Validating In this step, the states of the data (its consistancy and quality) are verified. If no issues are found to be resolved, the data is ready to be analysed.

Step 6: Publishing It is the final step where the data that has been validated is published. It can be published in different file formats ready for analysis by the organisation or an individual.

Sources: https://www.trifacta.com/data-wrangling/

https://online.hbs.edu/blog/post/data-wrangling